Deriving spatial features from in situ proteomics imaging to enhance cancer survival analysis

Abstract Motivation Spatial proteomics data have been used to map cell states and improve our understanding of tissue organization. More recently, these methods have been extended to study the impact of such organization on disease progression and patient survival. However, to date, the majority of supervised learning methods utilizing these data types did not take full advantage of the spatial information, impacting their performance and utilization. Results Taking inspiration from ecology and epidemiology, we developed novel spatial feature extraction methods for use with spatial proteomics data. We used these features to learn prediction models for cancer patient survival. As we show, using the spatial features led to consistent improvement over prior methods that used the spatial proteomics data for the same task. In addition, feature importance analysis revealed new insights about the cell interactions that contribute to patient survival. Availability and implementation The code for this work can be found at gitlab.com/enable-medicine-public/spatsurv.


Cell expression normalization
We perform the following steps to normalize cell biomarker expression. 2. Normalize the expression value using quantile normalization and inverse sine transformation: where Q(0.2; X (j) ) represents the 20th quantile of X (j) , and arcsinh is the inverse hyperbolic sine function. We can denote the array of normalized expression values for biomarker j as {f (x (j) 2 ), ...} as f (X (j) ). We use these values for computing the transformed Ripley's K-function as described in the main Methods section. 3. Calculate the z-score of the normalized expression value: where µ and σ are the mean and standard deviation of f (X (j) ), respectively.
These normalized values are used to assign cell types (Supplementary Figure 2).

Ripley's isotropic correction estimator
where p(x i , d ij ) is the fraction of the length of a circle centered at x i with radius d ij that lies within the window. This correction method treats edge effects as a form of sampling bias and assumes that the point process is isotropic (statistically invariant under rotation).

Estimation of the mark-weighted K-function
The mark-weighted K-function [1] can be estimated usinĝ where m i and m j are the mark values for feature m for points i and j respectively. Similar to the K-function, the mark-weighted K-function can be transformed tô

Computing Patwa et al. biomarker interactions [2]
We created Voronoi tessellations using the cell centroids, with adjacent Voronoi regions representing interactions between cells. Given two interacting cells, we constructed two lists: List 1 contained the names of the proteins that Cell 1 was positive for, and List 2 contained the names of the proteins that Cell 2 was positive for, using the thresholds as defined previously. We took the Cartesian product of the two lists to find all of the combinations of proteins present in this interaction. For each sample, we tallied all the biomarker interactions between interacting cells into a biomarker-biomarker matrix. We took the top half of the symmetric matrix and flattened it to create a 780-length feature vector for each sample.

Random survival forest (RSF) details
There are two key differences between traditional random forests and RSFs: 1) the splitting rule at each node within a tree, and 2) the predicted value for each sample. To handle event censoring times, RSFs employ a splitting rule that maximizes the log-rank test statistic at each node [3]. The log-rank test statistic is traditionally used for two-sample testing of survival distributions. Instead of predicting the length of survival for a patient as one may expect with a traditional random forest for regression, a RSF predicts the ensemble mortality for each sample, which measures the expected number of deaths/events at the given time under a null hypothesis of similar survival behavior. We can define d l,h and Y l,h to be the number of deaths and individuals, respectively, at time T l,h . h is a unique terminal node in the survival tree, and l denotes the distinct event times in the data. x i is the d-dimensional covariate for sample i. Due to the binary nature of survival trees, each x i falls into a unique terminal node h. In each survival tree in the forest, the cumulative hazard function (CHF) for sample i can be estimated with The ensemble CHF, H e (T |x i ) can be estimated by taking the average of H(T |x i ) across all survival trees in the forest. The ensemble mortality, or ensemble risk score, can then be computed with where n denotes the total number of distinct death times in the training data. In this work, we used RandomSurvivalForest from the Python package scikit-survival for the RSF implementation [4]. For predicting risk scores for each sample, we used the RandomSurvivalForest.predict function.

Concordance index
The concordance indexĉ [5] is computed as follows, where C is the number of concordant pairs, D is the number of discordant pairs, and R is the number of pairs with the same risk value. Pairs are only comparable if both patients experience an event (not censored), or the patient with the shorter survival time experienced an event. Pairs that do not satisfy either of these criteria are not considered in the concordance index.   biomarker expression by cell type    Average concordance index of combinations of non-spatial and spatial feature sets Figure 5: Average concordance index for random survival forest models trained using 100 different random seeds, with combinations of non-spatial and spatial feature sets. Concordance indices are presented with 95% confidence intervals. Orange: using non-spatial features, purple: combining non-spatial features with neighborhood matrix features, gold: combining non-spatial features with K-function features at r = 30, 80 µm, blue: combining neighborhood matrix and K-function features at r = 30, 80 µm.  celltype interaction pairs with significant differences between low-and high-risk cohorts Figure 6: Heatmaps showing cell type interaction pairs with significant differences in their neighborhood fraction values between low-and high-risk cohorts. Neighborhood fraction values are computed using k = 10 (Methods). The left heatmap shows significant pairs when values are greater in the low-risk than the high-risk cohort, and vice versa for the right heatmap. Each heatmap element h ij corresponds to neighborhood matrix element m ij , the fraction of celltype j within the k-nearest neighbors of celltype i, averaged across all cells of type i in the sample.